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Abstract 

We develop the Latent Multi-group Membership 
Graph (LMMG) model, a model of networks 
with rich node feature structure. In the LMMG 
model, each node belongs to multiple groups and 
each latent group models the occurrence of links 
as well as the node feature structure. The LMMG 
can be used to summarize the network structure, 
to predict links between the nodes, and to pre- 
dict missing features of a node. We derive effi- 
cient inference and learning algorithms and eval- 
uate the predictive performance of the LMMG on 
several social and document network datasets. 



1. Introduction 

Network data, such as social networks of friends, cita- 
tion networks of documents, and hyper-linked networks of 
webpages, play an increasingly important role in modern 
machine learning applications. Analyzing network data 
provides useful predictive models for recommending new 
friends in social networks (Backstrom & Leskovec, 2011) 
or scientific papers in document networks (Nallapati et al., 
2008; Chang & Blei, 2009). 

Research on networks has focused on various mod- 
els of network link structure. Latent variable mod- 
els (Airoldi et al., 2007; Hoffetal., 2002; Kemp et al., 
2006) decompose a network according to hidden pat- 
terns of connections between the nodes, while mod- 
els based on Kronecker products (Leskovec et al., 2010; 
Kim & Leskovec, 2012; 2011a) accurately model the 
global network structure. Though powerful, these models 
account only for the structure of the network, while ignor- 
ing observed features of the nodes. For example, in social 
networks users have profile information, and in document 
networks each node also contains the text of the document 
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that it represents. Such models can find patterns which ac- 
count for the connections between nodes, but they cannot 
account for the node features. 

Node features along with the links between them provide 
rich and complementary sources of information and should 
be used simultaneously for uncovering, understanding and 
exploiting the latent structure in the data. In this respect, we 
develop a new network model considering both the emer- 
gence of links of the network and the structure of node fea- 
tures such as user profile information or text of a document. 

Considering both sources of data, links and node features, 
leads to more powerful models than those that only con- 
sider links. For example, given a new node with a few 
of its links, traditional network models provide a predic- 
tive distribution of nodes to which it might be connected. 
However, to predict links of a node, our model does not 
need to see any links of a node. It can predict links using 
only node's features. For example, we can suggest user's 
friendships based only on the profile information, or rec- 
ommend hyperlinks of a webpage based only on its tex- 
tual information. Moreover, given a new node and its links, 
our model also provides a predictive distribution of node 
features. This can be used to predict features of a node 
given its links or even predict missing or hidden features 
of a node given its links. For example, in our model user's 
interests or keywords of a webpage can be predicted using 
only the connections of the network. Such predictions are 
out of reach for traditional models of networks. 

We develop a Latent Multi-group Membership Graph 
{LMMG) model of networks that explicitly ties nodes into 
groups of shared features and linking structure (Figure 1). 
Nodes belong to multiple latent groups and the occurrence 
of each node feature is determined by a logistic model 
based on the group memberships of the given node. Links 
of the network are then generated via link-affinity matrices. 
Each link-affinity matrix 6; represents a table of link prob- 
abilities, and an appropriate entry of O; is chosen based 
on whether or not a pair of nodes share the membership 
in group i. We derive effective algorithms for model pa- 
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Figure 1. Latent Multi-group Membership Graph model. A node 
belongs to multiple latent groups at once. Based on group mem- 
berships features of a node are generated using a logistic model. 
Links are modeled via link-affinity matrices which allows for rich 
interactions between members and non-members of groups. 

rameter estimation and prediction. We study the perfor- 
mance of LMMG on real-world social and document net- 
works. We investigate the predictive performance on three 
different tasks: link prediction, node feature prediction, and 
supervised node classification. The LMMG provides sig- 
nificantly better performance on all three tasks than natural 
alternatives and the current state of the art. 

2. LMMG Model Formulation 

The Latent Multi-group Membership Graph {LMMG) 
model is a model of a (directed or undirected) network and 
nodes which have categorical features. Our model contains 
two important ingredients or innovations (See Figure 1). 

First, the model assigns nodes to latent groups and allows 
nodes to belong to multiple groups at once. In contrast to 
multinomial models of group membership (Airoldi et al., 
2007; Chang & Blei, 2009), where the membership of a 
node is shared among the groups (the probability over 
group memberships of a node sums to 1), we model group 
memberships as a series of Bernoulli random variables 
(4>i in Figure 1), which indicates that nodes in our model 
can truly belong to multiple groups. Hence, in contrast 
to multinomial topic models, a higher probability of node 
membership to a group does not necessarily to lower prob- 
ability of membership to some other group in the LMMG. 

Second, for modeling the links of the network, each group 
k has associated a link-affinity matrix (0 in Figure 1). Each 
link-affinity matrix represents a table of link probabilities 
given that a pair of nodes belongs or does not belong to 
group k. Thus, depending on the combination of the mem- 
berships of nodes to group k, an appropriate element of 
is chosen. For example, the entry (0, 0) of captures the 
link-affinity when none of the nodes belongs to group k, 
while (1, 0) stores the link-affinity when first node belongs 
to the group but the second does not. As we will later show 
that this allows for rich flexibility in modeling the links of 
the network as well as for uncovering and understanding 
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Figure 2. Plate model representation of LMMG model. 

the latent structure in the network data. 

Now we formalize the LMMG model illustrated in Fig- 
ure 2 and describe it in a generative way. Formally, each 
node i = 1, 2, • • • , N has a real- valued group membership 
4>ik £ [0, 1] for each group k — 1, 2, • • • , K. (f>n~ represents 
the probability that node i belongs to group k. Assuming 
the Beta distribution parameterized by a^i, ctfc2 as a prior 
distribution of group membership fak, we model the latent 
group assignment zy. for each node as follows: 



4>ik ~ Beta(a fc i, a k2 ) 

Zik ~ Bcrnoulli(0jfc) for k = 1,2, 



,K. (1) 



Since each group membership of a node is independent, 
a node can belong to multiple groups simultaneously. 

The group memberships of a node affect both node features 
and its links. With respect to node features, we limit our 
focus to binary-valued features and use a logistic function 
to model the occurrence of node's features based on the 
groups it belongs to. For each feature Fu of node i ( I = 
1 , ■ • ■ , L ), we consider a separate logistic model where we 
regard group memberships <j>n , • • • , 4>iK as input features 
of the model. In this way, the logistic model represents the 
relevance of each group membership to the presence of a 
node feature. For convenience, we refer to the input vector 
of node i for the logistic model as fa = [fai, ■ ■ ■ , fan, 1], 
where fatK+i) = 1 represents the intercept term. Then, 

1 

Vii 



1 + exp(—wffa) 
F a ~ Bernoulli^/) for / 



1,2,--- ,L 



(2) 



where wi € M. K+1 is the logistic model parameter for the 
Z-th node feature. The value of each wik indicates the con- 
tribution of group k to the presence of node feature I. 

In order to model the links of the network, we build on 
the idea of the Multiplicative Attributes Random Graph 
(MAG) model (Kim & Leskovec, 2012). Here each la- 
tent group k has associated a link-affinity matrix 0^ 6 
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(a) Homophily (b) Heterophily (c) Core-periphery 
Figure 3. Link structures modeled by link-affinity matrices. 

[0, l] 2x2 . Each entry of the link-affinity matrix indicates a 
tendency of linking between a pair of nodes depending on 
whether they belong to the group k or not. In other words, 
given the group assignments and Zjk of nodes i and j, 
Zik "selects" a row and Zjk "selects" a column of 9^ and 
so that the linking tendency from node i to node j is cap- 
tured by Qk[zik, Zjk]- After acquiring such link-affinities 
from all the groups, we define the link probability pij as 
the product of the link-affinities. Therefore, based on latent 
group assignments and link-affinity matrices, we determine 
each entry of the adjacency matrix A E {0,l} JVxAr of the 
network as follows: 



Pij 



ne 



k [Zik ) Zjk\ 



Aij ~ Bernoulli(py) for i,j = 1, 2, • • • N . (3) 

The network model parameter represents the link affin- 
ity with respect to the particular group k. The model offers 
flexibility in a sense that we can represent many types of 
linking structures. In Figure 3, by varying the link-affinity 
matrix, the model can capture heterophily (love of the dif- 
ferent), homophily (love of the same), or core-periphery 
structure. This way the affinity matrix allows us to discover 
the effects of node features on links of the network. 

The node feature and the network models are connected 
via group memberships <\>i. For instance, suppose that 
wik is large for some feature I and topic k. Then, as 
the node i belongs to topic k with high probability (</>;£ 
is close to 1), the feature I of node i, Fu, is more likely 
to be 1. By modeling group memberships using multiple 
Bernoulli random variables (instead of using multinomial 
distribution (Airoldi et al., 2007; Chang & Blei, 2009)), we 
achieve greater modeling flexibility which allows for mak- 
ing predictions about links given features and features 
given links. In Section 4, we empirically demonstrate that 
the LMMG outperforms traditional models on these tasks. 

Moreover, if we divide the nodes of the network into two 
sets depending on the membership to group k, then we can 
discover how members of group k link to other members 
as well as non-members of fc, based on the structure of 
0fc. For example, when 0^ has large values on diagonal 
entries like in Figure 3(a), members or non-members are 
likely to link among themselves, while there is low affinity 
for links between members and non-members. Figure 3(b) 
captures exactly the opposite behavior where links are most 
likely between members and non-members. While the 



core-periphery structure is captured by link-affinity matrix 
in Figure 3(c) where nodes that share group memberships 
(the "core") are most likely to link, while nodes in the pe- 
riphery are least likely to link among themselves. 

3. Inference, Estimation and Prediction 

We now turn our attention to LMMG model estimation. 
Given a set of binary node features F and the network A, 
we aim to find node group memberships <fi, parameters W 
of node feature model, and link-affinity matrices 0. 



3.1. Problem formulation 

When the node features F — {Fu 



,N, I 



1, ■ • • , L} and the adjacency matrix A G {0, \}" xN are 
given, we aim to find the group memberships cf> = {tfiik : 
i = 1, ■ ■ ■ , N, k = 1, • • • , A"}, the logistic model param- 
eters W = {w lk : I = 1, • • • , L, k = 1, • • ■ , K + 1}, 
and the link-affinity matrices = {0^ : k = 1, • • • K}. 
We apply the maximum likelihood estimation, which finds 
the optimal values of (j>, W, and so that they maximize 
the likelihood P(F, A, 4>\W : 0, a) where a represents hy- 
per parameters, a — { (ctki , ctfc2) : k = 1, • ■ • , K}, for the 
Beta prior distributioins. In the end, we aim to solve 



max log P(F, A, 4>\ W, 0, a) . 

c/>,W,& 



(4) 



Now we compute the objective function in the above opti- 
mization problem. Since the LMMG independently gener- 
ates F and A given group memberships cf>, we decompose 
the log-likelihood log P(F, A, 4>\W,Q, a) as follows: 

log P(F, A, <j>\W,Q, a) 

= logP(F\cf>, W) + logP(A\q>, 0) + logP^H . (5) 

Hence, to compute log P(F, A, <f>\W, 0, a), we separately 
calculate each term of Equation (5). We obtain log P((f>\a) 
and log P(F\4>, W) from Equations (1) and (2): 

\ogP(4>\a) = ^(c*fei - l)log0 jfc 

+ ^2{&k2 - 1) l0g(l ~ <j>ik) 



\ogP{F\4>,W) 



F U \ogy u + (I - F u )log(l - y u ) 



where yu is defined in Equation (2). 

With regard to the second term in Equation (5), 

logP(A|0,0) = log^P(A|Z,0,0)P(Z|0,0) (6) 

z 

for Z = {z ik : i = 1, • • • , N, k = 1, • • • , K}. We note 
that A is independent of cf> given Z. To exactly calculate 
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log P{A\<j>, 0), we thus mmP{A\Z 1 Q)P{Z\4>) over every 
instance of Z given 9 and <f>, but this requires the sum over 
2 NK instances. As this exact computation is infeasible, we 
approximate log P(A\<j), 0) using its lower bound obtained 
by applying Jensen's Inequality to Equation (6): 

logP(A|0,e) =logEz~* [P(A\Z,Q)] 

> E z ^ [log P{A\Z,Q)} (7) 

Now that we are summing up over iV 2 terms, the 
computation of the lower bound is feasible. We thus 
maximize the lower bound L of the log-likelihood 

log P(A, F, 4>\W, 0, a). To sum up, we aim to maximize 

min -{L& + L f + L a ) + A|W|i (8) 

</>,W,& 

where L^ = logP(<£|a), L F = log P{F\4>, W), and 
L A = ^z~4, [log P{A\Z,W)}. To avoid overrating, we 
regularize the objective function by the LI -norm of W. 

3.2. Parameter estimation 

To solve the problem in Equation (8), we alternately up- 
date the group memberships 0, the model parameters W, 
and 0. Once 4>, W, and are initialized, we first update 
the group memberships <f> to maximize L with fixing W 
and 0. We then update the model parameters W and to 
minimize the function (— L + A|W|i) in Equation (8) by 
fixing (p. Note that L is decomposed into La, Lf, and Lq. 
Therefore, when updating W and given <j>, we separately 
maximize the corresponding log-likelihoods Lf and La- 
We repeat this alternate updating procedure until the solu- 
tion converges. In the following we describe the details. 

Update of group memberships 4>- Now we focus on the 
update of group membership <f> given the model parameters 
W and 0. We use the coordinate ascent algorithm which 
updates each membership (pik by fixing the others so to 
maximize the lower bound C. By computing the deriva- 
tives of Lcf,, Lf, and La we apply the gradient method to 
update each fak'. 

dLd, afei - 1 «fc2 - 1 



OCpik 

dL F 



dL A 



i 



4>ik l - 4>% 

{Fa - yu)wi k 



= E; 



E 



E 

dlogpji 



8 log Pij 



Oik 



E 

j:Ay=0 



dlog(l-py) 



»ik 



<91og(l -pji) 



j':A, 4 =0 



(9) 



where Fu is either or 1, and yu and is respectively 
defined in Equation (2) and (3). Due to the brevity, we de- 
scribe the details of Equation (9) in the Appendix. Hence, 



by adding up g§^, §§ IL , and fx^-, we complete comput- 
ing the derivative of the lower bound of log-likelihood 
and update the group membership (j)^ using the gradient 
method: 



hk = <Pik + f<P 



8Lf 9Lj 



dC, 



d&k dcpik 84>ik 



(10) 



for a given learning rate 70. By updating each (f).^ in turn 
with fixing the others, we can find the optimal group mem- 
berships </> given the model parameters W and 0. 

Update of node feature model parameters W. Now we 

update the parameters for node feature model, W, while 
group memberships <^ are fixed. Note that given the group 
membership <j> the node feature model and the network 
model are independent of each other. Therefore, finding 
the parameter W is identical to running the L 1 -regularized 
logistic regression given input <p and output F data as we 
penalize the objective function in Equation (8) on the LI 
value of the model parameter W. We basically use the gra- 
dient method to update W but make it sparse by applying 
the technique similar to LASSO: 



dL F 

dw, 



ik 



neto _ old , 
w lk — w lk + IF 



"Ik 



yu)4>ik 

dLp 

dw, 



ik 



\{k)Sign{wik) 



(ID 



if wfjf ^ or |f^E-| > A(Jfe) where A(fc) = A for k = 
1, ■ • • , K and \{K + 1) = {i.e., we do not regularize on 
the intercepts), is a constant learning rate. Furthermore, 
if wik crosses while being updated, we assign to wik as 
LASSO does. By this procedure, we can update the node 
feature model parameter W to maximize the lower bound 
of log-likelihood C as well as to maintain the small number 
of relevant groups for each node feature. 

Update of network model parameters 0. Next we focus 
on updating network model parameters, 0, also where the 
group membership <fi is fixed. Again, note that the network 
model is independent of the node feature model given the 
group membership tj>, so we do not need to consider L^ or 
L f ■ We thus update to maximize La given <\> using the 
gradient method. 



Ve„^»Ve fc E^ ( ^ logpy + ^ log(l-p y ) 

,Aij = l A ij= 



e new = Qold 



for a constant learning rate -/a- We explain the computation 
of Ve fc E z ^0 log and Ve fc E z ^ log(l - p. l0 ) in detail 
in the Appendix. 
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3.3. Prediction 

With a fitted model, our ultimate goal is to make predic- 
tions about new data. In the real-world application, the 
node features are often missing. Our algorithm is able to 
nicely handle such missing node features by fitting LMMG 
only to the observed features. In other words, when we 
update the group membership (f> or the feature model pa- 
rameter W by the gradient method from Equation (9) and 
(11), we only average the terms corresponding to the ob- 
served data. For example, when there is missing feature 
data, Equation (9) can be converted into as: 

for the observed data O. 

Similarly, for link prediction we modify the model estima- 
tion method as follows. While updating the node feature 
model parameters W based on the features of all the nodes 
including a new node, we estimate the network model pa- 
rameters 9 only on the observed network by holding out 
the new node. This way, the observed features naturally 
update the group memberships of a new node, we can pre- 
dict the missing node features or network links by using the 
estimated group memberships and model parameters. 

4. Experiments 

Here we perform experiments to evaluate our model. First, 
we run the various prediction tasks: missing node feature 
prediction, missing link prediction, and supervised node 
classification. In all tasks our model outperforms natural 
baselines. Second, we qualitatively analyze the relation- 
ships between node features and network structure by a 
case study of a Facebook ego-network and show how the 
LMMG identifies useful and interpretable latent structures. 

Datasets. For our experiments, we used the following 
datasets containing networks and node features. 

• AddHealth (AH): School friendship network (458 
nodes, 2,130 edges) with 35 school-related node 
features such as GPA, courses taken, and place- 
ment (Bearman et al., 1997). 

• Egonet(EGO): Facebook ego-network of a particular 
user (227 nodes, 6,348 edges) and 14 binary features 
(e.g. same high school, same age, and sports club), 
manually assigned to each friend by the user. 

• FacebooklOO (FB): Facebook network of Cal- 
tech (769 nodes, 33,312 edges) and 24 university- 
related node features like major, gender, and dormi- 
tory (Traud etal.,2011). 

• WebKB (WKB): Hyperlinks between computer sci- 
ence webpages of Cornell University in the WebKB 
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(a) Missing feature (b) Missing link (c) Supervised node 
prediction prediction classification 

Figure 4. Three link and feature based predictive tasks. 

dataset (195 nodes, 304 edges). We use occurrences 
of 993 words as binary features (Craven et al., 1998). 

We binarized discrete valued features (e.g. school year) 
based on whether the feature value is greater than the 
median value. For the non-binary categorical features 
(e.g. major), we used an indicator variable for each possible 
feature value. Some of these datasets and the source code 
of our algorithms are available at http://snap.stanford.edu. 

Predictive tasks. We investigate the predictive perfor- 
mance of the LMMG based on three different tasks. We 
visualize the three prediction tasks in Figure 4. Note that 
the column represents either features or nodes according to 
the type of the task. For each matrix, given 0/1 values in the 
white area, we predict the values of the entries with ques- 
tion marks. First, assuming that all node features of a given 
node are completely missing, we predict all the features 
based on the links of the node (Figure 4(a)). Second, when 
all the links of a given node are missing, we predict the 
missing links by using the node feature information (Fig- 
ure 4(b)). Last, we assume only few features of a node are 
missing and we perform the supervised classification of a 
specific node feature given all the other node features and 
the network (Figure 4(c)). 

Baseline models. Now we introduce natural baseline and 
state of the art methods. First, for the most basic baseline 
model, when predicting some missing value (node feature 
or link) of a given node, we average the corresponding val- 
ues of all the other nodes and regard it as the probability 
of value 1. We refer to this algorithm as AVG. Second, 
as we can view each of the three prediction tasks as the 
classification task, we use Collective Classification (CC) 
algorithms that exploit both node features and network de- 
pendencies (Sen et al., 2008). For the local classifier of CC 
algorithms, we use Naive-Bayes (CC-N) as well as logistic 
regression (CC-L). We also compare the LMMG to the state 
or the art Relational Topic Model (RTM) (Chang & Blei, 
2009). We give further details about these models and how 
they were applied in the Appendix. 

Task 1: Predicting missing node features. First, we ex- 
amine the performance for the task of predicting missing 
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LL 




AVG 


CC-N 


CC-L 


RTM 


LMMG 


AH 


-23.0 


-17.6 


-16.8 


-63.4 


-15.6 


EGO 


-5.4 


-6.6 


-5.1 


-9.9 


-3.7 


FB 


-8.7 


-11.6 


-8.9 


-19.0 


-7.4 


WKB 


-179.3 


-186.8 


-179.2 


-336.8 


-173.6 


ACC 




AVG 


CC-N 


CC-L 


RTM 


LMMG 


AH 


0.53 


0.61 


0.56 


0.59 


0.64 


EGO 


0.79 


0.81 


0.78 


0.74 


0.86 


FB 


0.77 


0.76 


0.75 


0.77 


0.80 


WKB 


0.88 


0.88 


0.89 


0.88 


0.90 



Table 1. Prediction of missing node attributes. The LMMG per- 
forms the best in terms of the log-likelihood as well as the classi- 
fication accuracy on the held-out data. 



features of a node where features of other nodes and all the 
links are observed. We randomly select a node and remove 
all the feature values of that node and try to recover them. 
We quantify the performance by using the log-likelihood 
of the true feature values over the estimated distributions 
as well as the predictive accuracy (the probability of cor- 
rectly predicting the missing features) of each method. 

Table 1 shows the results of the experiments by measur- 
ing the average of log-likelihood (LL) and prediction accu- 
racy (ACC) for each algorithm and each dataset. We notice 
that LMMG model exhibits the best performance in the log- 
likelihood for all datasets. While CC-L in general performs 
the second best, our model outperforms it by up to 23%. 
The performance gain over the other models in terms of ac- 
curacy seems smaller when compared to the log-likelihood. 
However, LMMG model still predicts the missing node fea- 
tures with the highest accuracy on all datasets. 

In particular, the LMMG exhibits the most improvement in 
node feature prediction on the ego-network dataset (30% 
in LL and 7% in ACC) over the next best method. As the 
node features are derived by manually labeling community 
memberships of each person in the ego-network dataset, 
a certain group of people in the network intrinsically share 
some node feature (community membership). In this sense, 
the node features and the links in the ego-network are di- 
rectly related to each other and our model successfully ex- 
ploits this relationship to predict missing node features. 

Task 2: Predicting missing links. Second, we also con- 
sider the task of predicting the missing links of a specific 
node while the features of the node are given. Similarly to 
the previous task, we select a node at random, but here we 
remove all its links while observing its features. We then 
aim to recover the missing links. For evaluation, we use 
the log-likelihood (LL) of missing links as well as the area 
under the ROC curve (AUC) of missing link prediction. 

We give the experimental results for each dataset in Ta- 



LL 




AVG 


CC-N 


CC-L 


RTM 


LMMG 


AH 


-40.2 


-57.2 


-38.9 


-100.6 


-36.1 


EGO 


-142.7 


-134.3 


-157.6 


-149.9 


-125.9 


FB 


-320.8 


-330.7 


-345.6 


-359.1 


-328.3 


WKB 


-54.2 


-185.5 


-39.6 


-25.8 


-13.7 


AUC 




AVG 


CC-N 


CC-L 


RTM 


LMMG 


AH 


0.51 


0.69 


0.39 


0.56 


0.72 


EGO 


0.61 


0.89 


0.55 


0.49 


0.89 


FB 


0.73 


0.70 


0.57 


0.46 


0.73 


WKB 


0.70 


0.86 


0.55 


0.50 


0.89 



Table 2. Prediction of missing links of a node. The LMMG per- 
forms best in all but one case. 



ble 2. Again, the LMMG outperforms the baseline mod- 
els in the log-likelihood except for the FacebooklOO data. 
Interestingly, while RTM was relatively competitive when 
predicting missing features, it tends to fail predicting miss- 
ing links, which implies that the flexibility of link-affinity 
matrices is needed for accurate modeling of the links. 

We observe that Collective Classification methods look 
competetive in some performance metrics and datasets. For 
example, CC-N gives good results in terms of classifica- 
tion accuracy, and CC-L performs well in terms of the log- 
likelihood. As CC-N is a discriminative model, it does not 
perform well in missing link probability estimation. How- 
ever, the LMMG is a generative model that produces a joint 
probability of node features and network links, so it is also 
very good at estimating missing links. Hence, in overall, 
the LMMG nicely exploits the relationship between the net- 
work structure and node features to predict missing links. 

Task 3: Supervised node classification. Finally, we 
examine the performance on the supervised classification 
task. In many cases, we aim to classify entities (nodes) 
based on their feature values under the supervised setting. 
Here the relationships (links) between the entities are also 
provided. For this experiment, we hold out one feature of 
nodes as the output class, regarding all other features of 
nodes and the network as input data. We divide the nodes 
into a 70% training and 30% test set. Similarly, we mea- 
sure the average of the log-likelihood (LL) as well as the 
average classification accuracy (ACC) on the test set. 

We illustrate the performance of various models in Table 3. 
The LMMG model performs better than the other mod- 
els in both the log-likelihood and the classification accu- 
racy. It improves the performance by up to 20% in the log- 
likelihood and 5% in the classification accuracy. We also 
notice that exploiting the relationship between node fea- 
tures and global network structure can improve the perfor- 
mance on supervised node classification compared to the 
models focusing on the local network dependencies (e.g., 
Collective Classification methods). 
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LL 




AVG 


CC-N 


CC-L 


RTM 


LMMG 


AH 


-84.5 


-486.6 


-60.5 


-236.0 


-55.3 


EGO 


-24.8 


-54.0 


-22.2 


-41.7 


-21.2 


FB 


-97.6 


-254.6 


-79.2 


-181.7 


-63.4 


WKB 


-17.5 


-254.6 


-15.4 


-193.6 


-15.0 


ACC 




AVG 


CC-N 


CC-L 


RTM 


LMMG 


AH 


0.52 


0.58 


0.63 


0.51 


0.63 


EGO 


0.76 


0.76 


0.77 


0.75 


0.79 


FB 


0.69 


0.71 


0.77 


0.72 


0.77 


WKB 


0.82 


0.81 


0.84 


0.84 


0.85 



Table 3. Supervised node classification. The LMMG gives the 
best performance on both metrics and all four datasets. 

Case study: Analysis of a Facebook ego-network. Now 

we qualitatively analyze the Facebook ego-network exam- 
ple to provide insights into the relationship between node 
features and network structure. We examine the estimated 
model parameters W (for features) and (for network 
structure). By investigating model parameters (W and 0), 
we can find not only what features are important for each 
group but also how each group affects the link structure. 

We begin by introducing the user which we used to cre- 
ate a network between his Facebook friends. We asked 
our user to label each of his friends with a number of la- 
bels. He chose to use 14 different labels. They corre- 
spond to his high school (HS), undergraduate university 
(University), math olympiad camp (Camp), computer 
programming club (KProg) and work place (KComp) 
friends. The user also assigned labels to identify friends 
from his graduate program (CS) and university (ST), bas- 
ketball (Basketball) and squash (Squash) clubs, as 
well as travel mates (Travel), summer internship buddies 
(Intern), family (Family) and age group (Age). 

We fit the LMMG to the ego-network and each friend's 
memberships to the above communities. We obtained the 
model parameters W and 0. For the validation procedure, 
we set the number of latent groups to 5 since the previous 
prediction tasks worked well when K = 5. In Table 4, for 
each of 5 latent groups, we represent the top 3 features with 
the largest absolute value of model parameter \wik | and the 
corresponding link-affinity matrices 0^. 

We begin by investigating the first group. The top three la- 
bels the most correlated to the first group are ST, Age, and 
Intern. However, notice that Intern is negatively corre- 
lated. This means that group 1 contains students from the 
same graduate school and age, but not people with whom 
our user worked together at the summer internship (even 
though they may be of the same school/age). We also note 
that 0i exhibits homophily structure. From this we learn 
that summer interns, who met our Facebook user neither 
because of shared graduate school nor because of the age, 



form a group within which people are densely connected. 
On the other hand, people of the same age at the same uni- 
versity also exhibit the homophily, but are less densely con- 
nected with each other. Such variation in link density that 
depends on the group memberships agrees with our intu- 
ition. Those who worked at the same company actively in- 
teract with each other so almost everyone is linked in Face- 
book. However, as the group of people of the same uni- 
versity or age is large and each pair of people in that group 
does not necessarily know each other, the link affinity in 
this group is naturally smaller than in the intern's group. 

Similarly, groups 2 and 3 form the two sports groups 
(BasketBall, Squash). People are connected densely 
within each of the groups, but less connected to the outside 
of the groups. This is natural because the sports clubs make 
members actively interact with each other but do not nec- 
essarily make members interact with those not in the clubs. 
Furthermore, we notice that those who graduated from not 
only the same high school (HS) but also the same under- 
graduate school (University) form another community 
but the membership to high school is more important than 
to the undergraduate university (8.7 vs. 2.3). 

However, for groups 4 and 5, we note that the correspond- 
ing link-affinity matrices are nearly flat (i.e. values are 
nearly uniform). This implies that groups 4 and 5 are re- 
lated to general node features. In this sense, we hypothesize 
that features like CS, family, math camp, and the company, 
have relatively little effect on the network structure. 

5. Related Work and Discussion 

The LMMG builds on previous research in machine learn- 
ing and network analysis. Many models have been de- 
veloped to explain network link structure (Airoldi et al., 
2007; Hoff et al., 2002; Kemp et al., 2006; Leskovec et al., 
2010) and extensions that incorporate node features have 
also been proposed (Getoor et al., 2001; Kim & Leskovec, 
201 lb; Taskar et al., 2003). However, these models do not 
consider latent groups and thus cannot provide the benefits 
of dimensionality reduction or produce interpre table clus- 
ters useful for understanding network community structure . 

The LMMG provides meaningful clustering of nodes and 
their features in the network. The network models of sim- 
ilar flavor have been proposed in the past (Airoldi et al., 
2007; Hoffetal., 2002; Kempetal., 2006), and some 
even incorporate node features (Chang & Blei, 2009; 
Nallapati et al., 2008; Miller et al., 2009). However, such 
models have been mainly developed for document net- 
works where they assume the multinomial topic distribu- 
tions for each word in the document. We extend this by 
learning a logistic model for occurrence of each feature 
based on node group memberships. To highlight the dif- 
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Group 


Top 1 


Top 2 


Top 3 


Link-affinity matrix 


1 


ST (9.0) 


Age (4.5) 


Intern (-3.7) 


[0.67 0.08; 0.08 0.17] 


2 


HS (-8.7) 


University (-2.3) 


Basketball (2.2) 


[0.26 0.18; 0.18 0.38] 


3 


University (-7.1) 


KorST (-2.6) 


Squash (2.2) 


[0.22 0.23; 0.23 0.32] 


4 


CS (7.3) 


Family (7.0) 


Camp (6.9) 


[0.25 0.24; 0.24 0.27] 


5 


KComp (5.2) 


KorST (4.4) 


Intern (-3.8) 


[0.29 0.22; 0.22 0.27] 



Table 4. Logistic model parameter values of top 3 features and the link-affinity matrix associated with each group in the ego-network. 



ference between the previous models and ours, since topic 
memberships in the above models are modeled by multino- 
mial distributions, a node has a mass of 1 to split among 
various topics. In contrast, in the LMMG, a node can be- 
long to multiple topics at once without any constraint. 

While previous work tends to explore only the network or 
only the features, the LMMG jointly models both so that it 
can make predictions on one given the other. The LMMG 
models the interaction between links and group member- 
ships via link-affinity matrices which provide great flexibil- 
ity and interpretability of obtained groups and interactions. 

The LMMG is a new probabilistic model of links and nodes 
in networks. It can be used for link prediction, node fea- 
ture prediction and supervised node classification. We 
have demonstrated qualitatively and quantitatively that the 
LMMG proves useful for analyzing network data. The 
LMMG significantly improves on previous models, inte- 
grating both node-specific information and link structure 
to give better predictions. 
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A. Mathematical Details 

A.l. Update of Group Membership tf> 

In Equation (9), we proposed the gradient ascent method 
which updates each group membership <f>ik to maximize 
the lower bound of log-likelihood C. To complete its 
computation, we further take a look at 3Ez g0 1 ° SP ' J and 
aiE z ^^og(i-p, 3 ) j n t j eta j2_ xhen, we can also compute 

0E Z , logp 3a 9E w log(l- PjI ) in the same 

a<Pik atpik ■> 

First, we calculate the derivative of expected log-likelihood 
for edges, Ez~<f, logpy. When all the group memberships 

except for <pi k are fixed, we can derive dEz g^'" 8 P ' J from 
definition of py in Equation (3) as follows: 



■E; 



E 



E lo s Qfc '[ 



(13) 

Here we use the following property. Since Zi k is an in- 
dependent Bernoulli random variable with probability <fii k , 
for any function /:{0,1} 2 ^IR, 

^Z~4,f{Zik,Zju) = <f>ik<l>jkf(l, 1) + 4>ik(l - (f>jk)f(l, 0) 

+ (1 - <t> ik )(t>jkf(0, 1) + (1 ~ ~ 3 -fc)/(O,O) . 

(14) 

Hence, by applying Equation (14) to (13), we obtain 

SEz^logpy d 
d(j)ik d(j> lk 

= 4> jk io g e fe [i, i] + (i - ^ fc )io g e fc [i,o] 

- <t> 3k log e fc [o, i] - (i - 0j fc ) io g e fe [o, o] . (15) 

Next, we compute the derivative of expected log-likelihood 
for unlinked node pairs, i.e. Ez~0log(l — py). Here we 
approximate the computation using the Taylor's expansion, 
log(l — x) w —x — 0.5a; 2 for small x: 



9E z ~0log(l -py) 



! >,k 



>ik 



To compute 

SE Z 



9ik 
d 

d 



Ez~0 W Sfc' [^ifc' , 2jfc'] 



— — E^^Ofc^ife,^] TT Qk'[zik',Zjk' 
d ^ k k'^k 
d 

Y\_ Ez^Ofc' [zik', Zjk'} -^j ~&z~<t>®k[zik,Zjk] 



k'^k 



i'ik 



By Equation (14), each Ez~j,<d k [zi k , Zjk] and its derivative 



SO 



can be obtained. Similarly, we can calculate 
we complete the computation of 3Ez ~^°g( 1 ~P i j) _ 

As we attain a ^ p " and 9e ^^s(i-p^ ^ we eyen . 
tually calculate Hence, by adding up 75^-7, and 



DC, 



d<t> ik ' d4>ik '■ 



we complete computing the derivative of the lower 



bound of log-likelihood 



DC 8Ca OCf dC. 



: >,k 



*>ik 



i>ik 



"ik 



A.2. Update of MAG Model Parameters 9 

Next we focus on the update of parameters of the network 
model, 0, where the group membership <f> is fixed. Since 
the network model is independent of the node attribute 
model given the group membership <f>, we do not need to 
consider Cp, or \ W\i. We thus update to maximize 
only La given using the gradient method. 

As we previously did in computing Ij^ by separating edge 



dCj 



for k 



and non-edge terms, we compute each 75- r 

b ' r de k [x 1 ,x 2 ] 

1, • • • , K and x\, x 2 G {0, !}■ To describe mathematically, 



ac, 



E 



<9E P 



■ logPy 



90fe[ari,a;2] A z -^ 1 dO k [xi,x 2 ] 



A li= 



dE z ~4, log(l -py) 
90*. [371,2:2] 



(16) 



Now we compute each term in the above calculation by the 
definition of py. First, we compute the former term by 
using Equation (14) For instance, 

9Ca h , v, 91og0fc[0, 1] (1 - <j>ik)<t>jk 
- = {l-<Pik)<Pjk - — 



90 fc [O,l] 



09* [0,1] 



e fc ro, n 



Hence, we can properly compute Equation (16) depending 
on the values of x\ and x 2 . 

Second, we use the same Taylor's expansion technique for 
the latter term in Equation (16) as follows: 



<9E^0log(l -pij) 
d@ k [x 1 ,x 2 ] 







d&k[xi,x 2 



(~Pij ~ 0.5p 2 



Similarly to Ez ^f k ij , ^gf^ is computed by 



Y[ ^<Z~<f>®k'[Zik'> z jk' 



i) 



k'^k 



dQk[xi, x 2 ] 



Ez^00fc[zifc, Zjk] 



where each term is obtained by Equation (14). Similarly, 
we compute fl f-f7^ P ' J 1 so that we can obtain a „ 9 r £A — r . 

f dB k [x 1 ,x 2 ] dB k [xi,x 2 ] 
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B. Implementation Details 
B.l. Initialization 

Since the objective function in Equation (8) is non-convex, 
the final solution might be dependent on the initial values 
of 4>, W, and 8. For reasonable initialization, as the node 
attributes F are given, we run the Singular Vector Decom- 
position (SVD) by regarding F as an N x L matrix and 
obtain the singular vectors corresponding to the top K sin- 
gular values. By taking the top K components, we can 
approximate the node attributes F over K latent dimen- 
sions. We thus assign the Z-th entry of the fc-th right sin- 
gular vectors multiplied by the fc-th singular value into wik 
for I = 1, • • • , L and fc = 1, • • • , K. We also initialize 
each group membership 0^ based on the z-th entry of the 
fc-th left singular vectors. This approximation can in par- 
ticular provide good enough initial values when the top K 
singular values dominate the others. In order to obtain the 
sparse model parameter W, we reassign to wik of small 
absolute value such that \wiu\ < A. 

Finally, to initialize the link-affinity matrices 9, we intro- 
duce the following way. When initializing the fc-th link- 
affinity matrix 0^, we assume that the group other than 
group fc has nothing to do with network structure, i.e. every 
entry in the other link-affinity matrices has the equal value. 
Then, we compute the ratio between entries Qk[x±, X2] for 
xi,X2 6 {0, 1} as follows: 

8fc [0:1,2:2] oc ^ ^z~4>P[z ik = x\, z lk = x 2 ] 

i, r .A ij = l 

As the group membership cf> is initialized above and Zik and 
Zjk are independent of each other, we are able to compute 
the ratio between entries of 0^. After computing the ratio 
between entries for each link-affinity matrix, we adjust the 
scale of the link-affinity matrices so that the expected num- 
ber of edges in the MAG model is equal to the number of 
edges in the given network, i.e. Pij = ^ij- 

B.2. Selection of the Number of Groups K 

Another issue in fitting the LMMG to the given network and 
node feature data is to determine the number of groups, 
K. We can find the insight about the value of K from 
the MAG model. It has been already proved that, in or- 
der for the MAG model to reasonably represent the real- 
world network, the value of K should be in the order of 
log N where TV represents the number of nodes in the net- 
work (Kim & Leskovec, 2012). Since in the LMMG the 
network links are modeled similarly to the MAG model, 
the same argument on the number of groups K still holds. 

However, the above argument cannot determine the specific 
value of K. To select one value of K, we use the cross- 
validation method as follows. For instance, suppose that we 



aim to predict all the features of a node where its links to the 
other nodes are fully observed (Task 1 in Section 4). While 
holding out the test node, we can set up the same prediction 
task in a way that we select one at random from the other 
nodes (training nodes) and regard it as the validation test 
node. We then perform the missing node feature predic- 
tion on this validation node and obtain the log-likelihood 
result. By running this procedure with varying the vali- 
dation test node, we can attain the average log-likelihood 
on the missing node features given the specific value of K 
(i.e. N-fold cross-validation). Finally, we compare the av- 
erage log-likelihood values according the value of K and 
pick up the best one to maximize the log-likelihood. This 
method can be done by the other prediction tasks, missing 
link prediction and supervised node classification. 

B.3. Baseline Models 

Here we briefly describe how we implemented each base- 
line method depending on the type of prediction task. 

AVG. In this baseline method, we regard each Z-th node 
feature and a link to the z-th node as an independent ran- 
dom variable, respectively. In other words, we assume that 
missing node features or links do not depend on each other. 
Hence, we predict the l-th missing node feature by find- 
ing the probability that the l-th node feature of all the other 
nodes have value 1. We then regard the found probability 
as that of the missing l-th node feature taking value 1. 

Similarly, when we predict missing links (in particular, the 
link to the z-th node) of a given node, we average the prob- 
ability that all the other nodes are linked to the z-th node 
and take it as the probability of link from the given node to 
the z-th node (i.e. preferential attachment). 

CC-N. For this method, we basically use the Naive-Bayes 
method using node features of each node as well as those 
of neighboring nodes. To represent each node feature of 
neighboring nodes by a single value, we select the majority 
value (either or 1) from the neighbors' feature values. 

However, we cannot use the node features when predict- 
ing all the node features of a given node. Furthermore, the 
node features of neighboring nodes are unattainable when 
we predict missing links. Therefore, depending on the type 
of prediction task, we exploit only achievable information 
among node features and those of neighboring nodes. 

CC-L. We employ the similar approach to the CC-N. How- 
ever, here we use the logistic regression rather than the 
Naive-Bayes and average the feature values of neighboring 
nodes rather than pick up the majority value. 

RTM. We use the lda-R package to run RTM (http://cran.r- 
project.org/web/packages/lda/index.html). 



